{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python批量翻译英语单词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***用途：***   \n",
    "对批量的英语文本，生成英语-汉语翻译的单词本，提供Excel下载\n",
    "\n",
    "***本代码实现：***\n",
    "1. 提供一个英文文章URL，自动下载网页；\n",
    "2. 实现网页中所有英语单词的翻译；\n",
    "3. 下载翻译结果的Excel\n",
    "\n",
    "***涉及技术：***\n",
    "1. pandas的读取csv、多数据merge、输出Excel\n",
    "2. requests库下载HTML网页\n",
    "3. BeautifulSoup解析HTML网页\n",
    "4. Python正则表达式实现英文分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 读取英语-汉语翻译词典文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "词典文件来自：https://github.com/skywind3000/ECDICT\n",
    "使用步骤：\n",
    "1. 下载代码打包：https://github.com/skywind3000/ECDICT/archive/master.zip\n",
    "2. 解压master.zip，然后解压其中的‪stardict.csv文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "d:\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3063: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.\n",
      "  interactivity=interactivity, compiler=compiler, result=result)\n"
     ]
    }
   ],
   "source": [
    "# 注意：stardict.csv的地址需要替换成你自己的文件地址\n",
    "df_dict = pd.read_csv(\"D:/tmp/ECDICT-master/stardict.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3402564, 13)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dict.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>phonetic</th>\n",
       "      <th>definition</th>\n",
       "      <th>translation</th>\n",
       "      <th>pos</th>\n",
       "      <th>collins</th>\n",
       "      <th>oxford</th>\n",
       "      <th>tag</th>\n",
       "      <th>bnc</th>\n",
       "      <th>frq</th>\n",
       "      <th>exchange</th>\n",
       "      <th>detail</th>\n",
       "      <th>audio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3370509</th>\n",
       "      <td>WWDH</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[网络] 淇楄壋</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>518014</th>\n",
       "      <td>chauhtan (chotan)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>卓丹</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>389953</th>\n",
       "      <td>breviarist</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[网络] 短笛师</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>951231</th>\n",
       "      <td>electric-vehicle</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>abbr. “EV”的变体；“electric car”的变体\\n[网络] 电动汽车</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91258</th>\n",
       "      <td>Albionian</td>\n",
       "      <td>æl'biәniәn</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[地质]阿尔比翁期</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      word    phonetic definition  \\\n",
       "3370509               WWDH         NaN        NaN   \n",
       "518014   chauhtan (chotan)         NaN        NaN   \n",
       "389953          breviarist         NaN        NaN   \n",
       "951231    electric-vehicle         NaN        NaN   \n",
       "91258            Albionian  æl'biәniәn        NaN   \n",
       "\n",
       "                                        translation  pos  collins  oxford  \\\n",
       "3370509                                    [网络] 淇楄壋  NaN      NaN     NaN   \n",
       "518014                                           卓丹  NaN      NaN     NaN   \n",
       "389953                                     [网络] 短笛师  NaN      NaN     NaN   \n",
       "951231   abbr. “EV”的变体；“electric car”的变体\\n[网络] 电动汽车  NaN      NaN     NaN   \n",
       "91258                                     [地质]阿尔比翁期  NaN      NaN     NaN   \n",
       "\n",
       "         tag  bnc  frq exchange detail  audio  \n",
       "3370509  NaN  NaN  NaN      NaN    NaN    NaN  \n",
       "518014   NaN  NaN  NaN      NaN    NaN    NaN  \n",
       "389953   NaN  NaN  NaN      NaN    NaN    NaN  \n",
       "951231   NaN  NaN  NaN      NaN    NaN    NaN  \n",
       "91258    NaN  0.0  0.0      NaN    NaN    NaN  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dict.sample(10).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>translation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>'a</td>\n",
       "      <td>na. 一\\nn. 英文字母表的第一字母；【乐】A音\\nart. 冠以不定冠词主要表示类别\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>'A' game</td>\n",
       "      <td>[网络] 游戏；一个游戏；一局</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>'Abbāsīyah</td>\n",
       "      <td>[地名] 阿巴西耶 ( 埃 )</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>'Abd al Kūrī</td>\n",
       "      <td>[地名] 阿卜杜勒库里岛 ( 也门 )</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>'Abd al Mājid</td>\n",
       "      <td>[地名] 阿卜杜勒马吉德 ( 苏丹 )</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            word                                        translation\n",
       "0             'a  na. 一\\nn. 英文字母表的第一字母；【乐】A音\\nart. 冠以不定冠词主要表示类别\\...\n",
       "1       'A' game                                    [网络] 游戏；一个游戏；一局\n",
       "2     'Abbāsīyah                                    [地名] 阿巴西耶 ( 埃 )\n",
       "3   'Abd al Kūrī                                [地名] 阿卜杜勒库里岛 ( 也门 )\n",
       "4  'Abd al Mājid                                [地名] 阿卜杜勒马吉德 ( 苏丹 )"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 把word、translation之外的列扔掉\n",
    "df_dict = df_dict[[\"word\", \"translation\"]]\n",
    "df_dict.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 下载网页，得到网页内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pandas官方文档中的一个URL\n",
    "url = \"https://pandas.pydata.org/docs/user_guide/indexing.html\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "html_cont = requests.get(url).text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\n<!DOCTYPE html>\\n\\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\\n  <head>\\n    <meta charset=\"utf-8\" />'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html_cont[:100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 提取HTML的正文内容\n",
    "即：去除HTML标签，获取正文"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html_cont)\n",
    "html_text = soup.get_text()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\n\\nIndexing and selecting data — pandas 1.0.1 documentation\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMathJax.Hub.Config({\"tex2jax\": {\"inlineMath\": [[\"$\", \"$\"], [\"\\\\\\\\(\", \"\\\\\\\\)\"]], \"processEscapes\": true, \"ignoreClass\": \"document\", \"processClass\": \"math|output_area\"}})\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nHome\\n\\n\\nWhat\\'s New in 1.0.0\\n\\n\\nGetting started\\n\\n\\nUser Guide\\n\\n\\nAPI reference\\n\\n\\nDevelopment\\n\\n\\nRelease Notes\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nIO tools (text, CSV, HDF5, â\\x80¦)\\n\\n\\nIndexing and selecting data\\n\\n\\nMultiIndex / advanced indexing\\n\\n\\nMerge, join, a'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html_text[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 英文分词和数据清洗"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['', '', '', 'Indexing', 'and', 'selecting', 'data', '—', 'pandas', '1']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 分词\n",
    "import re\n",
    "word_list = re.split(\"\"\"[ ,.\\(\\)/\\n|\\-:=\\$\\[\"']\"\"\",html_text)\n",
    "word_list[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['',\n",
       " 'itself',\n",
       " 'showed',\n",
       " 'throughout',\n",
       " 'pointed',\n",
       " 'n',\n",
       " 'against',\n",
       " 'name',\n",
       " 'none',\n",
       " 'ran']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取停用词表，从网上复制的，位于当前目录下\n",
    "with open(\"./datas/stop_words/stop_words.txt\") as fin:\n",
    "    stop_words=set(fin.read().split(\"\\n\"))\n",
    "list(stop_words)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['indexing',\n",
       " 'selecting',\n",
       " 'data',\n",
       " 'pandas',\n",
       " 'documentation',\n",
       " 'mathjax',\n",
       " 'hub',\n",
       " 'config',\n",
       " 'tex2jax',\n",
       " 'inlinemath',\n",
       " '\\\\\\\\',\n",
       " '\\\\\\\\',\n",
       " ']]',\n",
       " 'processescapes',\n",
       " 'true',\n",
       " 'ignoreclass',\n",
       " 'document',\n",
       " 'processclass',\n",
       " 'math',\n",
       " 'output_area']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 数据清洗\n",
    "word_list_clean = []\n",
    "for word in word_list:\n",
    "    word = str(word).lower().strip()\n",
    "    # 过滤掉空词、数字、单个字符的词、停用词\n",
    "    if not word or word.isnumeric() or len(word)<=1 or word in stop_words:\n",
    "        continue\n",
    "    word_list_clean.append(word)\n",
    "word_list_clean[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. 分词结果构造成一个DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>indexing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>selecting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>pandas</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>documentation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            word\n",
       "0       indexing\n",
       "1      selecting\n",
       "2           data\n",
       "3         pandas\n",
       "4  documentation"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words = pd.DataFrame({\n",
    "    \"word\": word_list_clean\n",
    "})\n",
    "df_words.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4915, 1)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>620</th>\n",
       "      <td>df</td>\n",
       "      <td>161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>659</th>\n",
       "      <td>dtype</td>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1274</th>\n",
       "      <td>true</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593</th>\n",
       "      <td>dataframe</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1038</th>\n",
       "      <td>pd</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>917</th>\n",
       "      <td>loc</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>970</th>\n",
       "      <td>nan</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>721</th>\n",
       "      <td>false</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>914</th>\n",
       "      <td>list</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>835</th>\n",
       "      <td>indexing</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           word  count\n",
       "620          df    161\n",
       "659       dtype     87\n",
       "1274       true     86\n",
       "593   dataframe     80\n",
       "1038         pd     75\n",
       "917         loc     72\n",
       "970         nan     72\n",
       "721       false     58\n",
       "914        list     58\n",
       "835    indexing     53"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 统计词频\n",
    "df_words = (\n",
    "    df_words\n",
    "    .groupby(\"word\")[\"word\"]\n",
    "    .agg(count=\"size\")\n",
    "    .reset_index()\n",
    "    .sort_values(by=\"count\", ascending=False)\n",
    ")\n",
    "df_words.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. 和单词词典实现merge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merge = pd.merge(\n",
    "    left = df_dict,\n",
    "    right = df_words,\n",
    "    left_on = \"word\",\n",
    "    right_on = \"word\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>translation</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>658</th>\n",
       "      <td>team</td>\n",
       "      <td>n. 队, 组\\nvt. 把马(牛)套在同一辆车上, 把...编成一组\\nvi. 驾驶卡车, 协作</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>523</th>\n",
       "      <td>providing</td>\n",
       "      <td>conj. 以...为条件, 假如</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>394</th>\n",
       "      <td>lines</td>\n",
       "      <td>n. 台词</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>columns</td>\n",
       "      <td>塔器</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>136</th>\n",
       "      <td>conforms</td>\n",
       "      <td>v. 遵守( conform的第三人称单数 ); 顺应; 相一致; 相符合</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>529</th>\n",
       "      <td>python</td>\n",
       "      <td>n. 大蟒, 巨蟒\\n[计] Python 程序设计语言；人生苦短，我用 Python</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>determine</td>\n",
       "      <td>v. 决定, 决心</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>285</th>\n",
       "      <td>forward</td>\n",
       "      <td>a. 向前的, 早的, 迅速的, 在前的, 进步的\\nvt. 促进...的生长, 转寄, 运...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>arguments</td>\n",
       "      <td>n. 参数</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>564</th>\n",
       "      <td>reported</td>\n",
       "      <td>a. 报告的；据报道的</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          word                                        translation  count\n",
       "658       team  n. 队, 组\\nvt. 把马(牛)套在同一辆车上, 把...编成一组\\nvi. 驾驶卡车, 协作      3\n",
       "523  providing                                  conj. 以...为条件, 假如      1\n",
       "394      lines                                              n. 台词      1\n",
       "118    columns                                                 塔器     49\n",
       "136   conforms              v. 遵守( conform的第三人称单数 ); 顺应; 相一致; 相符合      1\n",
       "529     python        n. 大蟒, 巨蟒\\n[计] Python 程序设计语言；人生苦短，我用 Python     26\n",
       "185  determine                                          v. 决定, 决心      1\n",
       "285    forward  a. 向前的, 早的, 迅速的, 在前的, 进步的\\nvt. 促进...的生长, 转寄, 运...      1\n",
       "49   arguments                                              n. 参数      3\n",
       "564   reported                                        a. 报告的；据报道的      1"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merge.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(718, 3)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merge.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. 存入Excel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merge.to_excel(\"./38. batch_chinese_english.xlsx\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 后续升级：\n",
    "1. 可以提供txt/excel/word/pdf的批量输入，生成单词本；\n",
    "2. 可以做成网页、微信小程序的形式，在线访问和使用\n",
    "3. 用户可以标记或上传“已经认识的词语”，每次过滤掉"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
